Published

Monday, 08/01/2024

1 Data Visualization with Base R

  • Data visualisation important in understanding our data
  • Base R have a good graphing capability

1.1 Practical

  • generate some random number e.g. normal distribution with rnorm(_) function
Code
rand_ds <- data.frame(x = rnorm(1000, mean = 10, sd = 1), 
                      y = rnorm(1000, mean = 5, sd = 2))
rand_ds
  • plot scatterplot using plot(_) function
Code
plot(x = rand_ds$x, y = rand_ds$y)

2 Data Visualization with ggplot2

  • There are various packages that offer powerful graphing capabilities, most famous is ggplot2:: package
  • ggplot2 were initially developed independently, but later harmonised with tidyverse packages
  • based on “grammar of graphics” philosophy:
    • specifying datasets,
    • aesthetic mappings,
    • geometric objects,
    • statistical transformations,
    • etc: scales, coordinate systems, and facets
Code
library(ggplot2)

ggplot(rand_ds, aes(x, y)) + geom_point()

2.1 Practical

In this practical, we will use data asthmads_clean.sav from dataset folder

  1. Load related packages
  2. Import dataset
  • Create new column, to calculate weight difference
Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(haven)

asthmads_clean <- read_sav("asthmads_clean.sav") %>% 
  as_factor() %>% 
  mutate(Wt_Diff = Weight_Post - Weight_Pre)
asthmads_clean
  1. With ggplot2, you begin a plot with the function ggplot(_), defining a plot object that you then add layers to.
Code
ggplot(data = asthmads_clean)

Code
asthmads_clean %>% 
  ggplot(data = .)

  1. Next, we need to tell ggplot(_) how the information from our data will be visually represented.
  • The mapping argument defines how variables are mapped to visual properties (aesthetics) of the plot.
  • The mapping argument is always defined in the aes(_) function,
  • the x and y arguments of aes(_) specify which variables to map to the x and y axes.
Code
asthmads_clean %>% 
  ggplot(data = ., 
         mapping = aes(x = PA_HW, 
                       y = Wt_Diff))

  1. Our “empty canvas” now has more structure - Physical Activity in x-axes and Weight Difference in y-axis
  2. We then need to define a geometrical object that a plot use to represent data
  • in ggplot2, function start with geom_
  • to plot scatter plot, we use geom_point(_) function
Code
asthmads_clean %>% 
  ggplot(data = ., 
         mapping = aes(x = PA_HW, 
                       y = Wt_Diff)) + 
  geom_point()

  1. We can add aesthetics and layers
  • for example, we want to layer the plot by gender
  • to add layer, we need to modify the aesthetic (rather than geom)
Code
asthmads_clean %>% 
  ggplot(data = ., 
         mapping = aes(x = PA_HW, 
                       y = Wt_Diff,
                       colour = Gender)) + 
  geom_point()

3 Other Charts

3.1 Bar Chart

  • suitable for categorical data
Code
asthmads_clean %>% 
  ggplot(aes(Gender)) + 
  geom_bar()

  • add colour with fill parameter
Code
asthmads_clean %>% 
  ggplot(aes(x = Gender, fill = Gender)) + 
  geom_bar()

3.2 Histogram

  • unlike categorical variables, numerical variables use histogram instead
  • commonly used to visualised the data distribution
Code
asthmads_clean %>% 
  ggplot(aes(Weight_Pre)) + 
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • we can adjust fill to make clearer
Code
asthmads_clean %>% 
  ggplot(aes(x = Weight_Pre)) + 
  geom_histogram(fill = "white", colour = "black")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • we can also set the bin width size
Code
asthmads_clean %>% 
  ggplot(aes(x = Weight_Pre)) + 
  geom_histogram(fill = "white", colour = "black", binwidth = 2)

3.3 Line chart (1)

  • Line chart usually use to visualised trend
  • we use newly form data in this example
Code
time_ds <- tibble(time = 1:10, 
                     value = c(2, 3, 5, 7, 8, 9, 10, 12, 14, 15))
time_ds
  • here, we want to plot the value by time
Code
time_ds %>% 
  ggplot(aes(x = time, y = value)) +
  geom_line()

3.4 Line chart (2)

  • create line chart, visualising the weight changes
  • however we need to wrangle, in long format
Code
asthmads_clean %>% 
  select(idR, Gender, Weight_Pre, Weight_Post) %>% 
  pivot_longer(cols = starts_with("Weight"), 
               names_to = "event",
               values_to = "weight")
  • add geom_point(_)
  • add `geom_line(_)
    • add group parameter, the line “grouped” by idR
Code
asthmads_clean %>% 
  select(idR, Gender, Weight_Pre, Weight_Post) %>% 
  pivot_longer(cols = starts_with("Weight"), 
               names_to = "event",
               values_to = "weight") %>% 
  ggplot(aes(x = event,
             y = weight)) +
  geom_point() + 
  geom_line(aes(group = idR))

  • we can add gender as layer, adjusting the aes
Code
asthmads_clean %>% 
  select(idR, Gender, Weight_Pre, Weight_Post) %>% 
  pivot_longer(cols = starts_with("Weight"), 
               names_to = "event",
               values_to = "weight") %>% 
  ggplot(aes(x = event,
             y = weight, 
             colour = Gender)) +
  geom_point() + 
  geom_line(aes(group = idR))

4 Bonus: Malaysia Map

  • load the prevalence of Known DM among Malaysian (data source: NHMS 2019 report, page 36)
Code
nhms19_adm <- read_csv("../dataset/nhms19_adm.csv")
Rows: 16 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): State
dbl (3): Prevalence, Upper_CI, Lower_CI

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
nhms19_adm
  • download geojson file from dosm github
Code
download.file(
  url = "https://raw.githubusercontent.com/dosm-malaysia/data-open/main/datasets/geodata/administrative_1_state.geojson",
  destfile = "administrative_1_state.geojson",
  mode = "wb")
  • load sf package
Code
library(sf)
Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
  • harmonised (wrangling) the data
    • load the map
Code
my_state_sf <- read_sf("administrative_1_state.geojson")
my_state_sf
  • combine prevalence dataset with map dataset
Code
left_join(nhms19_adm, my_state_sf)
  • adjustment maybe needed due to differing column name
Code
my_state_sf <- my_state_sf %>% 
  rename(State = state)
my_state_sf
Code
full_join(nhms19_adm, my_state_sf)
Joining with `by = join_by(State)`
  • the data didnt join for WP. further adjustment should be made
Code
my_state_sf <- my_state_sf %>% 
  mutate(State = fct_recode(State,
                            "WP Kuala Lumpur" = "W.P. Kuala Lumpur", 
                            "WP Putrajaya" = "W.P. Putrajaya", 
                            "WP Labuan" = "W.P. Labuan"))

nhms19_adm_m <- full_join(nhms19_adm, my_state_sf)
Joining with `by = join_by(State)`
Code
nhms19_adm_m
  • Plotting with geom_sf(_) function
Code
nhms19_adm_m %>% 
  ggplot(aes(fill = Prevalence)) + 
  geom_sf()
  • seems our dataframe is not sf object.
Code
nhms19_adm_sf <- st_as_sf(nhms19_adm_m)
nhms19_adm_sf
  • Plotting with geom_sf(_) function
Code
nhms19_adm_sf %>% 
  ggplot(aes(fill = Prevalence)) + 
  geom_sf()

  • we can also change colour to look like heat map
Code
nhms19_adm_sf %>% 
  ggplot(aes(fill = Prevalence)) + 
  geom_sf() + 
  scale_fill_gradient(low = "green", high = "red")

  • we can also remove the element
Code
nhms19_adm_sf %>% 
  ggplot(aes(fill = Prevalence)) + 
  geom_sf() + 
  scale_fill_gradient(low = "green", high = "red") +
  theme_bw() +
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        panel.border = element_blank())